convolution algorithm
Causes and Effects of Unanticipated Numerical Deviations in Neural Network Inference Frameworks
Hardware-specific optimizations in machine learning (ML) frameworks can cause numerical deviations of inference results. Quite surprisingly, despite using a fixed trained model and fixed input data, inference results are not consistent across platforms, and sometimes not even deterministic on the same platform. We study the causes of these numerical deviations for convolutional neural networks (CNN) on realistic end-to-end inference pipelines and in isolated experiments. Results from 75 distinct platforms suggest that the main causes of deviations on CPUs are differences in SIMD use, and the selection of convolution algorithms at runtime on GPUs. We link the causes and propagation effects to properties of the ML model and evaluate potential mitigations. We make our research code publicly available.
Reviews: Designing by Training: Acceleration Neural Network for Fast High-Dimensional Convolution
I am happy with the responses they have provided to my initial concerns which have improved the manuscript. I would encourage authors to add an appendix should they believe they can convey a more complete message without having to wait for drafting another another longer manuscript. They utilise the new Acceleration Network (AccNet) to express the approximation function of splatting/blurring/slicing (SBS) and generalise SBS by changing the architecture of AccNet. Upon finishing the training process, the proposed fast convolution algorithm can be derived from the weights and activation functions of each layer. They conducted various experiments that prove the effectiveness of the proposed algorithm.
SFC: Achieve Accurate Fast Convolution under Low-precision Arithmetic
He, Liulu, Zhao, Yufei, Gao, Rui, Du, Yuan, Du, Li
Fast convolution algorithms, including Winograd and FFT, can efficiently accelerate convolution operations in deep models. However, these algorithms depend on high-precision arithmetic to maintain inference accuracy, which conflicts with the model quantization. To resolve this conflict and further improve the efficiency of quantized convolution, we proposes SFC, a new algebra transform for fast convolution by extending the Discrete Fourier Transform (DFT) with symbolic computing, in which only additions are required to perform the transformation at specific transform points, avoiding the calculation of irrational number and reducing the requirement for precision. Additionally, we enhance convolution efficiency by introducing correction terms to convert invalid circular convolution outputs of the Fourier method into effective ones. The numerical error analysis is presented for the first time in this type of work and proves that our algorithms can provide a 3.68x multiplication reduction for 3x3 convolution, while the Winograd algorithm only achieves a 2.25x reduction with similarly low numerical errors. Experiments carried out on benchmarks and FPGA show that our new algorithms can further improve the computation efficiency of quantized models while maintaining accuracy, surpassing both the quantization-alone method and existing works on fast convolution quantization.
HyPHEN: A Hybrid Packing Method and Optimizations for Homomorphic Encryption-Based Neural Networks
Kim, Donghwan, Park, Jaiyoung, Kim, Jongmin, Kim, Sangpyo, Ahn, Jung Ho
Convolutional neural network (CNN) inference using fully homomorphic encryption (FHE) is a promising private inference (PI) solution due to the capability of FHE that enables offloading the whole computation process to the server while protecting the privacy of sensitive user data. Prior FHE-based CNN (HCNN) work has demonstrated the feasibility of constructing deep neural network architectures such as ResNet using FHE. Despite these advancements, HCNN still faces significant challenges in practicality due to the high computational and memory overhead. To overcome these limitations, we present HyPHEN, a deep HCNN construction that incorporates novel convolution algorithms (RAConv and CAConv), data packing methods (2D gap packing and PRCR scheme), and optimization techniques tailored to HCNN construction. Such enhancements enable HyPHEN to substantially reduce the memory footprint and the number of expensive homomorphic operations, such as ciphertext rotation and bootstrapping. As a result, HyPHEN brings the latency of HCNN CIFAR-10 inference down to a practical level at 1.4 seconds (ResNet-20) and demonstrates HCNN ImageNet inference for the first time at 14.7 seconds (ResNet-18).
Im2win: Memory Efficient Convolution On SIMD Architectures
Lu, Shuai, Chu, Jun, Liu, Xu T.
Convolution is the most expensive operation among neural network operations, thus its performance is critical to the overall performance of neural networks. Commonly used convolution approaches, including general matrix multiplication (GEMM)-based convolution and direct convolution, rely on im2col for data transformation or do not use data transformation at all, respectively. However, the im2col data transformation can lead to at least 2$\times$ memory footprint compared to not using data transformation at all, thus limiting the size of neural network models running on memory-limited systems. Meanwhile, not using data transformation usually performs poorly due to nonconsecutive memory access although it consumes less memory. To solve those problems, we propose a new memory-efficient data transformation algorithm, called im2win. This algorithm refactorizes a row of square or rectangle dot product windows of the input image and flattens unique elements within these windows into a row in the output tensor, which enables consecutive memory access and data reuse, and thus greatly reduces the memory overhead. Furthermore, we propose a high-performance im2win-based convolution algorithm with various optimizations, including vectorization, loop reordering, etc. Our experimental results show that our algorithm reduces the memory overhead by average to 41.6% compared to the PyTorch's convolution implementation based on im2col, and achieves average to 3.6$\times$ and 5.3$\times$ speedup in performance compared to the im2col-based convolution and not using data transformation, respectively.
Im2win: An Efficient Convolution Paradigm on GPU
Lu, Shuai, Chu, Jun, Guo, Luanzheng, Liu, Xu T.
Convolution is the most time-consuming operation in deep neural network operations, so its performance is critical to the overall performance of the neural network. The commonly used methods for convolution on GPU include the general matrix multiplication (GEMM)-based convolution and the direct convolution. GEMM-based convolution relies on the im2col algorithm, which results in a large memory footprint and reduced performance. Direct convolution does not have the large memory footprint problem, but the performance is not on par with GEMM-based approach because of the discontinuous memory access. This paper proposes a window-order-based convolution paradigm on GPU, called im2win, which not only reduces memory footprint but also offers continuous memory accesses, resulting in improved performance. Furthermore, we apply a range of optimization techniques on the convolution CUDA kernel, including shared memory, tiling, micro-kernel, double buffer, and prefetching. We compare our implementation with the direct convolution, and PyTorch's GEMM-based convolution with cuBLAS and six cuDNN-based convolution implementations, with twelve state-of-the-art DNN benchmarks. The experimental results show that our implementation 1) uses less memory footprint by 23.1% and achieves 3.5$\times$ TFLOPS compared with cuBLAS, 2) uses less memory footprint by 32.8% and achieves up to 1.8$\times$ TFLOPS compared with the best performant convolutions in cuDNN, and 3) achieves up to 155$\times$ TFLOPS compared with the direct convolution. We further perform an ablation study on the applied optimization techniques and find that the micro-kernel has the greatest positive impact on performance.
Hyperspectral Remote Sensing Image Classification Based on Multi-scale Cross Graphic Convolution
Zhao, Yunsong, Li, Yin, Chen, Zhihan, Qiu, Tianchong, Liu, Guojin
The mining and utilization of features directly affect the classification performance of models used in the classification and recognition of hyperspectral remote sensing images. Traditional models usually conduct feature mining from a single perspective, with the features mined being limited and the internal relationships between them being ignored. Consequently, useful features are lost and classification results are unsatisfactory. To fully mine and utilize image features, a new multi-scale feature-mining learning algorithm (MGRNet) is proposed. The model uses principal component analysis to reduce the dimensionality of the original hyperspectral image (HSI) to retain 99.99% of its semantic information and extract dimensionality reduction features. Using a multi-scale convolution algorithm, the input dimensionality reduction features were mined to obtain shallow features, which then served as inputs into a multi-scale graph convolution algorithm to construct the internal relationships between eigenvalues at different scales. We then carried out cross fusion of multi-scale information obtained by graph convolution, before inputting the new information obtained into the residual network algorithm for deep feature mining. Finally, a flexible maximum transfer function classifier was used to predict the final features and complete the classification. Experiments on three common hyperspectral datasets showed the MGRNet algorithm proposed in this paper to be superior to traditional methods in recognition accuracy.
cuConv: A CUDA Implementation of Convolution for CNN Inference
Jordà, Marc, Valero-Lara, Pedro, Peña, Antonio J.
Convolutions are the core operation of deep learning applications based on Convolutional Neural Networks (CNNs). Current GPU architectures are highly efficient for training and deploying deep CNNs, and hence, these are largely used in production for this purpose. State-of-the-art implementations, however, present a lack of efficiency for some commonly used network configurations. In this paper we propose a GPU-based implementation of the convolution operation for CNN inference that favors coalesced accesses, without requiring prior data transformations. Our experiments demonstrate that our proposal yields notable performance improvements in a range of common CNN forward propagation convolution configurations, with speedups of up to 2.29x with respect to the best implementation of convolution in cuDNN, hence covering a relevant region in currently existing approaches.
CVML Live Web-Lecture Series – Icarus
CVML Live Web Lecture Series Concept Artificial Intelligence and Information analysis (AIIA) Lab, AUTH is proud to launch the live CVML Web lecture series that will cover very important topics Computer vision/machine learning. Top scientists internationally will deliver these lectures, aiming at providing in-depth knowledge on various CVML topics. The 1-hour lectures will take place on Saturdays, to avoid conflicts with other intended registrant schedules/duties: a) Saturdays 11:00 EET (17:00 Beijing time) and b) Saturdays 20:00 EET (13:00 EST, 10:00 PST for NY/LA, respectively) for audience in the Americas. Each lecture will be announced at least 1 week in advance in various relevant email lists and in this page. Lectures will consist primarily of live lecture streaming and PPT slides.